Skip to content

Conversation

@jocelynj
Copy link

Hi,

This commit adds a new config "full-metadata" to dump all Info fields from the PBF format, if they are available. I’ve kept them as Option<> to align with how the proto file is defined - it defines most of the fields as optional.

New available fields:

  • version
  • timestamp, as an i64 number, in milliseconds since 1970 epoch
  • changeset
  • user id
  • user name
  • visible, set to true by default

This is greatly inspired from PR ##40, with the addition of more fields.

I’ve launched some basic read tests using par_iter(), with/without the config on France pbf, which is 4.4GB big. To read all nodes/ways/relations, without any additional computation, on a 2*6-cores cpu, it takes:

  • 18 s ± 0.2 s without metadata
  • 24 s ± 2.9 s with metadata

I think it makes sense to keep the configurability, to not impact performance, but I can remove it if you prefer.

…bf file

With cfg='full-metadata' enabled, these fields are available in an
'info' field of Node/Way/Relation object:
  - version
  - timestamp, as an i64 number, in milliseconds since 1970 epoch
  - changeset
  - user id
  - user name
  - visible, set to true by default

This is greatly inspired from PR #TeXitoi#40, with the
addition of more fields.
src/objects.rs Outdated
pub decimicro_lon: i32,
#[cfg(feature = "full-metadata")]
/// Additional metadata
pub info: Option<Info>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a field in a struct with all field public break the "adding a feature is not breaking" contract. For example, if someone does

let Node { id, tags, decimicro_lat, decimicro_lon } = node;

this code will not compile if the feature is activated.

We can keep the field and only populate it with an parameter, but the Info struct will make the different structs significantly bigger. We can use pub info: Option<Box<Info>>, to limit the augmented size, but it will impact the performance with and without the Info retrieval.

What do you propose?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know of the rule that enabling a feature shouldn't break logic, but it makes sense.

There is also the possibility of adding functions like iter_with_metadata, but it might mean to duplicate most of the logic. Or maybe we could play with generics, but might be too complex.

I will do some trials with your proposition of info: Option<Box<Info>>.

In any case, does it mean that there will be a need to bump the major version of the package? My original thought was that adding a new feature would help to not break backward compatibility.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we modify a struct with all pub field, yes, it will need a version increment.

Another possiblitity would be to encapsulate in the other way:

struct WithInfo<T> {
    obj: T,
    info: Option<Info>,
}

And iterate on it, with dedicated methods.

Copy link
Author

@jocelynj jocelynj Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have launched examples/count.rs on france.osm.pbf, with Option<Box<Info>>, and it is slower than expected. These stats are generated with hyperfine

  • main
    Time (mean ± σ): 18.375 s ± 1.249 s [User: 129.637 s, System: 4.216 s]
    Range (min … max): 17.712 s … 21.905 s 10 runs

  • this version, with full-metadata off
    Time (mean ± σ): 18.196 s ± 0.307 s [User: 130.923 s, System: 4.132 s]
    Range (min … max): 17.900 s … 18.967 s 10 runs

  • this version, with full-metadata on
    Time (mean ± σ): 24.424 s ± 0.363 s [User: 181.625 s, System: 5.445 s]
    Range (min … max): 24.039 s … 25.108 s 10 runs

  • this version, but using NonZero<> in Info, with full-metadata on
    Time (mean ± σ): 21.686 s ± 0.255 s [User: 152.626 s, System: 4.924 s]
    Range (min … max): 21.476 s … 22.326 s 10 runs

  • Option<Box<Info>>, with full-metadata off
    Time (mean ± σ): 19.162 s ± 0.440 s [User: 132.996 s, System: 4.502 s]
    Range (min … max): 18.824 s … 20.170 s 10 runs

  • Option<Box<Info>>, with full-metadata on
    Time (mean ± σ): 35.523 s ± 0.286 s [User: 160.999 s, System: 5.323 s]
    Range (min … max): 35.129 s … 35.998 s 10 runs

  • Option<Box<Info>>, using NonZero<> in Info, with full-metadata on
    Time (mean ± σ): 36.360 s ± 0.513 s [User: 162.979 s, System: 4.954 s]
    Range (min … max): 35.727 s … 37.125 s 10 runs

What would you think of making the info field private, with a size depending on the feature full-metadata, and adding a function get_info() to get its value?

New functions are added to get full metadata: iter_with_metadata() and
par_iter_with_metadata(), with same behaviour than iter() and
par_iter(). They return ObjInfo objects, which is a node/way/relation
with an optional Info field.
@jocelynj
Copy link
Author

I’ve changed the code to add a NodeInfo/WayInfo/RelationInfo structure, containing a separate Node/Way/Relation+Info fields. The previous functions iter() and par_iter() still return normal objects, and I’ve added new functions iter_with_metadata() and par_iter_with_metadata() to return objects with an Info field. I had to duplicate some logic, but maybe it would be better to make all the logic generic on Node/Way/Relation, as I did in src/groups.rs. I’m not a rust specialist, so the code might need some improvements :)

I’ve also modified Info to use Option<NonZero<...>>.

On the new code, performance is very good:

  • full-metadata off
    Time (mean ± σ): 17.952 s ± 0.102 s [User: 128.248 s, System: 4.136 s]
    Range (min … max): 17.814 s … 18.108 s 10 runs

  • full-metadata on, using par_iter()
    Time (mean ± σ): 18.551 s ± 0.115 s [User: 141.156 s, System: 4.206 s]
    Range (min … max): 18.424 s … 18.700 s 10 runs

  • full-metadata on, using par_iter_with_metadata()
    Time (mean ± σ): 21.294 s ± 0.063 s [User: 147.180 s, System: 4.691 s]
    Range (min … max): 21.211 s … 21.386 s 10 runs

I think it is still worth keeping the optional feature, but we could remove it - I suspect that the additional logic in DenseNodes to update the denseinfo structure is the culprit for the small decrease in performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants